Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech
نویسندگان
چکیده
The present research aims to build an MSA audio-visual corpus. The corpus is annotated both phonetically and visually and dedicated to emotional speech processing studies. The building of the corpus consists of 5 main stages: speaker selection, sentences selection, recording, annotation and evaluation. 500 sentences were critically selected based on their phonemic distribution. The speaker was instructed to read the same 500 sentences with 6 emotions (HappinessSadnessFearAngerInquiry Neutral). A sample of 50 sentences was selected for annotation. The corpus evaluation modules were: audio, visual and audio –visual subjective evaluation. The corpus evaluation process showed that happy, anger and inquiry emotions were better recognized visually (94%, 96% and 96%) than audibly (63.6%, 74% and 74%) and the audio visual evaluation scores (96%, 89.6% and 80.8%). Sadness and fear emotion on the other hand were better recognized audibly (76.8% and 97.6%) than visually (58% and 78.8 %) and the audio visual evaluation scores were (65.6% and 90%).
منابع مشابه
Designing the Latvian Speech Recognition Corpus
In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus ...
متن کاملEfficient Diphone Database Creation for MBROLA, a Multilingual Speech Synthesiser
Diphone synthesis is a convenient way for testing phonetic models of human speech. It allows easy manipulation of duration and pitch, therefore it is used not only for general intonation contour evaluation, but also for expressive speech synthesis. The main advantage of using MBROLA [11][9],[12],[13] is the fact that not all the diphones need to be contained in the voice to test speech models. ...
متن کاملDesign and recording of Czech speech corpus for audio-visual continuous speech recognition
In this paper we describe the design, recording, and content of a large audio-visual speech database intended for training and testing of audio-visual continuous speech recognition systems. The UWB05-HSCAVC database contains high resolution video and quality audio data suitable for experiments on audio-visual speech recognition. The corpus consists of nearly 40 hours of audiovisual records of 1...
متن کاملThe MMASCS multi-modal annotated synchronous corpus of audio, video, facial motion and tongue motion data of normal, fast and slow speech
In this paper, we describe and analyze a corpus of speech data that we have recorded in multiple modalities simultaneously: facial motion via optical motion capturing, tongue motion via electro-magnetic articulography, as well as conventional video and highquality audio. The corpus consists of 320 phonetically diverse sentences uttered by a male Austrian German speaker at normal, fast and slow ...
متن کاملEvaluating an Authentic Audio-Visual Expressive Speech Corpus
This paper presents an evaluation of the acted part of an audio-visual corpus of emotional speech. This corpus is intended to collect both spontaneous and acted emotions, and then the perceptive efficiency of stimuli to carry emotional expression has to be rated. The evaluation of acted speech is presented here, and will give us a scale to measure the spontaneous expressions.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017